• Fall 2022, DSPA (HS650)
  • Name: Pratap Gude & Shivangi Kumar
  • SID: #### - 2799 & #### - 7245
  • UMich E-mail: or
  • We certify that the following paper represents our own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

1. Abstract

Chronic Heart failure has an average of less than one-year of survival rate after diagnosis, and thus it is essential for predicting the life span based on patient characteristics. While some studies show, patients might have, on average, 5 years post-diagnosis [Final Stages of Heart Failure: End-Stage Heart Failure, 2020], which could also be crucial to increase patient interactions and move them to palliative and supportive care, nurturing them to increase their quality of life. The Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database is a publicly available resource for intensive care research using which we tried to include factors such as age groups and SAPS scores to correlate mortality rates with factors associated with chronic heart failure (CHF).

2. Background

Dataset The dataset is derived from MIMIC-II, the publicly accessible critical care database. It contains a summary of clinical data and outcomes for 1,776 patients. The dataset (full_cohort_data.csv) is a comma-separated value file that includes a header with descriptive variable names.

To Access the dataset Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters. https://physionet.org/content/mimic2-iaccd/1.0/

Primary Usage of the Dataset The primary use of this dataset is to carry out the case study in Chapter 16 of Secondary Analysis of Electronic Health Records. The case study data walks the reader through the process of examining the effect of indwelling arterial catheters (IAC) on 28-day mortality in the intensive care unit (ICU) in patients who were mechanically ventilated during the first day of ICU admission.

Strengths and Weaknesses of the Dataset The dataset is of MIMIC; thus, its trustability of it is assured since MIMIC is a reputed data source for medical data. It is also an open-source data set and can be accessed by everyone. It is an extensive dataset spanning numerous attributes like physiological parameters, body constituents, disease presence, and so on. The data dictionary is self-explanatory. Most importantly, the data does not contain any missing values or parameters in it; the completeness of the dataset is a major advantage for any analysis. Despite the cleanliness and completeness of the dataset, the fact that there are only 1776 instances is less for in-depth detailed analysis and model building. If there were more instances or patients recorded as part of the dataset, the subsequent study and its finding would be more inclusive and meaningful, which can stand true in numerous cases.

SAPS Scores SAPS III Admission Scores categorize patients based on their risk level to the worst prognosis. The scoring criteria have three categories based on patient characteristics before ICU Admission, circumstances of ICU Admission, and presence and degree of physiological derangement at ICU Admission. Based on these criteria, patients are categorized and scored to provide a prompt and optimized way of care to alleviate patient factors and satisfaction rates.

Population MIMIC data, population with SAPS score between 5-15 SAPS score is the risk of mortality of the patient in the ICU based on the severity of the disease condition.

Intervention or Exposure Variable Congestive heart failure (chf_flg) is a binary variable where 0 indicates the negative outcome and 1 indicates the positive outcome.

Comparison We aim to compare patients with congestive heart failure and without congestive heart failure. Congestive heart failure and chronic renal disease had a correlation of 0.2475 with mortality (relatively higher than the other variables in the dataset), which led us to choose congestive heart failure and chronic renal disease as the exposure variable and confounder, respectively.

Outcome Variable The outcome variable is censored or death (censor_flg) which is a binary variable indicative of death when equal to 0 and indicative of censored when equal to 1. Also, because the SAPS score is an indication of mortality, hence it was more relevant to choose mortality as an outcome variable.

Confounder(s) There are various confounders in the dataset, such as categorical variable, chronic renal disease (renal_flg) was chosen. It is medically observed that having chronic kidney disease (CKD) implies a greater chance of having heart disease [American Kidney Fund. (2022, February 15)]. CKD can cause heart disease, and heart disease can cause CKD. In fact, heart disease is the most common cause of death among people on dialysis. Renal disease as a confounder can affect or impact both the exposure variable of heart disease and the outcome variable of mortality. Other confounds include continuous variable hemoglobin count (hgb_first), which is taken at the time of admission of a patient to the ICU. Reduced hemoglobin in patients with congestive heart failure (CHF) has been shown to be independently associated with an increased risk of hospitalization and all-cause mortality. Findings suggest a linear association between reduced hemoglobin and increased mortality risk. In studies that analyzed hemoglobin as a continuous variable, a 1-g/dL decrease in hemoglobin was independently associated with significantly increased mortality risk [Tang, Y. D., & Katz, S. D. (2006)].

3. Methodology

After initial data importing and setup. We sorted the data, cleaned it, and checked for missingness using the naniar package. We performed exploratory data analysis on various variables( both categorical and continuous) to hypothesize the question. Later we used Clustering to find the most frequent number of clusters, PCA to check the dimensionality of the dataset, and feature selection using the boruta package before moving to the model selection and evaluating performances.

Question of Interest To find the mortality rate for a population with SAPS scores between 5-15, according to the age group, on patients with and without heart disease who are admitted to the ICU.

3.1 Finding Missigness

## [1] 0
  • No missingness - no imputations required

3.2 2D Histogram

Interpretation & Analysis: Based on our preliminary data exploration and visualization, we found that higher WBC counts, which correlate with immunity in the literature, relate to fewer ICU stay days. Most patients with chronic diseases admitted to the ICU who had an initial WBC count between 0 and 30 on the first day of ICU admission stayed in the ICU for a greater number of days as compared to those who had an initial WBC count greater than 30 on the first day of ICU admission.

3.3 Correlational Plot

##                          age  gender_num  sapsi_first      chf_flg  censor_flg
## age               1.00000000 -0.13808683  0.217115435  0.285634843 -0.41506929
## gender_num       -0.13808683  1.00000000 -0.067136553 -0.067469341  0.03413822
## sapsi_first       0.21711543 -0.06713655  1.000000000  0.061033686 -0.16957742
## chf_flg           0.28563484 -0.06746934  0.061033686  1.000000000 -0.16028429
## censor_flg       -0.41506929  0.03413822 -0.169577419 -0.160284289  1.00000000
## renal_flg         0.14029191  0.05972044  0.100854089  0.257320151 -0.04454071
## wbc_first        -0.13362334  0.04053159  0.016360147 -0.064126509  0.01257482
## hgb_first        -0.09391657  0.11932993 -0.157179505 -0.110309782  0.08484022
## icu_los_day       0.01159406  0.04297427  0.066149587  0.088013720 -0.05456521
## hospital_los_day -0.07099685  0.05962433  0.001499691 -0.002386556  0.09456228
##                    renal_flg    wbc_first    hgb_first  icu_los_day
## age               0.14029191 -0.133623339 -0.093916573  0.011594062
## gender_num        0.05972044  0.040531586  0.119329935  0.042974266
## sapsi_first       0.10085409  0.016360147 -0.157179505  0.066149587
## chf_flg           0.25732015 -0.064126509 -0.110309782  0.088013720
## censor_flg       -0.04454071  0.012574818  0.084840225 -0.054565214
## renal_flg         1.00000000 -0.105679041 -0.077274127 -0.025475375
## wbc_first        -0.10567904  1.000000000  0.031932388  0.009156714
## hgb_first        -0.07727413  0.031932388  1.000000000  0.046753674
## icu_los_day      -0.02547538  0.009156714  0.046753674  1.000000000
## hospital_los_day -0.02339171 -0.035493384 -0.005714846  0.565693453
##                  hospital_los_day
## age                  -0.070996847
## gender_num            0.059624334
## sapsi_first           0.001499691
## chf_flg              -0.002386556
## censor_flg            0.094562283
## renal_flg            -0.023391715
## wbc_first            -0.035493384
## hgb_first            -0.005714846
## icu_los_day           0.565693453
## hospital_los_day      1.000000000

Interpretation & Analysis: We plotted this correlational plot to find the variable that has highest correlation with the number of days spent in ICU. We found that the number of days spent in ICU and number of days in the hospital are highly correlated variables.

3.4 3-D Scatter Plot

Interpretation & Analysis: The correlational plot gave us insights into the high association between the number of days spent in the ICU and the number of days in the hospital. This helped us to plot a 3D density plot and find the age factor and mortality associated with the length of stay in the ICU and hospital. The green dots are “alive” patients, and the blue dots are the “dead” patients. Patients with chronic diseases and ages greater than 80 tend to have a lower length of stay in the hospital and ICU and are more likely to die. Patients within the age group of 20 to 50 who are admitted to the ICU have varying lengths of stay in the ICU and hospital, with the likelihood of being alive after ICU admission.

3.5 Multi-variate Bar Plot

Interpretation & Analysis: On dividing the dataset into age groups below and above age 60 and with and without heart disease, we found that mortality rates are higher in age group above 60 in general and especially in with heart disease.

3.6 k - means Clustering

## K-means clustering with 3 clusters of sizes 113, 145, 498
## 
## Cluster means:
##   gender_num sapsi_first   chf_flg censor_flg  renal_flg wbc_first hgb_first
## 1  0.6017699    17.76106 0.1504425  0.6725664 0.05309735 11.682035  12.25841
## 2  0.5241379    18.19310 0.1241379  0.6068966 0.02068966 22.245517  11.92897
## 3  0.5060241    17.69277 0.1847390  0.5702811 0.06224900  9.735984  11.95060
##   icu_los_day hospital_los_day
## 1    9.340708        23.823009
## 2    3.718828         7.682759
## 3    3.140502         6.433735
## 
## Clustering vector:
##   [1] 3 2 3 3 1 1 3 2 2 3 1 3 2 3 1 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 1 3 3 3 3 3 3
##  [38] 1 1 1 2 3 3 3 3 1 1 3 3 2 3 1 3 3 2 3 3 1 2 3 3 3 3 2 3 3 2 3 3 2 3 1 3 3
##  [75] 3 3 3 3 1 3 3 3 3 3 2 3 3 2 1 3 3 2 3 3 3 1 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 2 3 3 2 3 3 3 3 1 1 3 3 3 3 2 2 2 2 3 3 1 2 3 3 3 3 3 1 1 3 3 2 3 3 3 1
## [149] 1 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 1 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3
## [186] 2 3 3 1 3 3 3 3 3 1 3 2 3 3 1 1 3 3 3 3 3 2 3 2 3 1 3 3 3 3 3 1 1 1 3 2 1
## [223] 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 1 2 2 3 3 3 3 1 3 3 3 3 1 2
## [260] 3 3 3 3 3 3 1 2 1 3 1 2 2 3 3 1 3 2 2 3 2 3 1 2 2 3 2 3 3 2 2 3 2 1 2 3 3
## [297] 3 2 1 3 3 3 1 3 3 3 2 3 2 2 3 3 3 3 3 3 2 3 3 2 2 3 3 3 3 3 1 2 2 2 3 3 3
## [334] 3 3 3 3 3 2 3 3 3 3 2 3 2 1 3 3 2 3 3 3 3 2 1 3 3 3 3 2 3 3 3 3 3 3 1 3 3
## [371] 2 3 3 2 1 1 2 1 3 3 3 3 1 3 2 1 2 3 2 3 3 2 3 3 3 1 3 1 3 2 2 3 3 3 2 2 3
## [408] 3 3 2 1 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 1 3 2 3 2 3 3 3 1 1 2 3
## [445] 1 3 1 3 3 3 3 3 3 2 3 3 3 3 2 1 2 1 3 1 3 3 3 3 2 1 1 2 2 3 3 3 3 3 3 3 2
## [482] 2 3 1 3 3 2 3 3 3 3 1 2 3 3 1 2 3 2 3 3 3 3 2 3 3 3 3 3 3 1 1 1 1 3 3 1 1
## [519] 2 3 3 1 3 1 2 3 2 3 1 3 3 2 3 1 3 3 3 3 3 3 1 1 3 2 1 3 3 2 2 3 3 2 1 1 1
## [556] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 2 3 1 3 3 3 3 3 3 2 3 1 3 1 3 3 3 2 3 3 3
## [593] 3 2 3 3 3 3 1 2 3 1 2 3 3 3 3 2 3 2 3 3 3 3 3 2 3 3 3 1 3 3 3 3 3 3 2 2 3
## [630] 3 3 3 3 2 3 1 3 3 1 2 3 3 3 2 2 3 1 2 3 1 3 3 3 1 3 3 3 3 1 1 2 3 3 3 3 3
## [667] 3 2 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 2 1 3 3 1 2 3 3 1 2 3 3 3 3 3 1 1
## [704] 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 2 3 3 2 3 3 3 1 3 2 3 2 2 1 1 3 1 2 3 1
## [741] 3 3 3 3 3 2 3 3 3 3 3 3 2 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 23957.65 20561.73 20291.51
##  (between_SS / total_SS =  43.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Interpretation & Analysis: The bar plot of k-means clustering algorithm displays how much of the dtaa is represented using 1,2, or 3 clusters. It appears that 2 or 3 clusters represent the data quite adequately. In addition, the silhouette plot gives us the optimal number of clusters representing the data which is 2 clusters.

3.7 PCA

## Importance of components:
##                          PC1    PC2    PC3    PC4    PC5    PC6     PC7     PC8
## Standard deviation     1.378 1.0687 1.0545 1.0031 0.9767 0.9501 0.90159 0.80364
## Proportion of Variance 0.211 0.1269 0.1235 0.1118 0.1060 0.1003 0.09032 0.07176
## Cumulative Proportion  0.211 0.3379 0.4615 0.5733 0.6793 0.7796 0.86988 0.94165
##                            PC9
## Standard deviation     0.72470
## Proportion of Variance 0.05835
## Cumulative Proportion  1.00000

Interpretation & Analysis: The score plots indicate the projection of the data onto the span of the principal components. Scores further out are either outliers or naturally extreme observations. In addition, most of the data points have first PC score of 0 whereas few data points have first PC score as -2.

3.8 Feature Selection

## Boruta performed 99 iterations in 31.98341 secs.
##  8 attributes confirmed important: age, agegroup, censor_flg,
## hgb_first, icu_los_day and 3 more;
##  1 attributes confirmed unimportant: gender_num;
##  3 tentative attributes left: chf_flg, heart_failure, hospital_los_day;

Using tentative and confirmed important attributes: we select sapsi_first, heart_failure, mortality

##           age gender_num  chf_flg censor_flg renal_flg wbc_first hgb_first
## [1,] 12.12949 -0.9084812 1.762399   6.906130  4.389413  4.161883  7.152791
## [2,] 13.14845  2.2868219 2.122811   8.523155  4.142409  3.319650  7.362015
## [3,] 11.18096  0.6705264 3.687560   4.390831  4.074064  3.197270  5.621322
## [4,] 12.27479  0.8830721 2.278991   7.654170  5.412227  3.430496  7.397266
## [5,] 12.29947  0.2566831 1.720871   6.827953  7.079582  3.454351  6.018177
## [6,] 11.96974 -0.6589906 2.050596   5.877351  4.497742  2.564088  8.042439
##      icu_los_day hospital_los_day agegroup heart_failure mortality
## [1,]    5.499151        4.8730066 6.538453     2.7255033  7.613152
## [2,]    3.897263        1.5475419 6.804577     2.3191152 10.827528
## [3,]    5.478581        0.9876445 6.977166     2.5511299  7.436863
## [4,]    6.337409        3.7439084 6.493189     0.6335887  8.681894
## [5,]    4.225862        1.1968753 8.459238     1.9928457  6.464950
## [6,]    6.070956        0.1418116 8.267406     0.5236195  5.262396
## Boruta performed 99 iterations in 31.98341 secs.
## Tentatives roughfixed over the last 99 iterations.
##  10 attributes confirmed important: age, agegroup, censor_flg, chf_flg,
## heart_failure and 5 more;
##  2 attributes confirmed unimportant: gender_num, hospital_los_day;

Interpretation & Analysis: We excluded all the rejected features with infinite importance in our analysis. Then, we sorted the non-rejected or important features according to their median importance and print them using plotly by representing them as boxplots. In this whiskerplot, the variables are represented such that their median, quartiles and min and max are visible to decide which are tentative and important variables. We can see the range of importance scores within a single variable in the graph. It may be desirable to get rid of tentative features.

3.9 Modeling and model evaluation

3.9.1 Normalizing Data

  • Checking if normalization worked
summary(without_conf_n$sapsi_first)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.05882 0.11765 0.16464 0.23529 1.00000

3.9.2 Partition data

3.9.3 Rpart

## n= 613 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 613 240 1 (0.3915171 0.6084829)  
##     2) age>=0.5790757 347 153 0 (0.5590778 0.4409222)  
##       4) age>=0.8433225 66  16 0 (0.7575758 0.2424242)  
##         8) icu_los_day>=0.01825741 59  11 0 (0.8135593 0.1864407) *
##         9) icu_los_day< 0.01825741 7   2 1 (0.2857143 0.7142857) *
##       5) age< 0.8433225 281 137 0 (0.5124555 0.4875445)  
##        10) age>=0.6998215 171  74 0 (0.5672515 0.4327485)  
##          20) icu_los_day>=0.1878163 34   8 0 (0.7647059 0.2352941) *
##          21) icu_los_day< 0.1878163 137  66 0 (0.5182482 0.4817518)  
##            42) icu_los_day< 0.06381056 70  26 0 (0.6285714 0.3714286) *
##            43) icu_los_day>=0.06381056 67  27 1 (0.4029851 0.5970149) *
##        11) age< 0.6998215 110  47 1 (0.4272727 0.5727273)  
##          22) age< 0.6716549 71  35 0 (0.5070423 0.4929577)  
##            44) icu_los_day< 0.1885394 55  23 0 (0.5818182 0.4181818)  
##              88) age>=0.642669 22   6 0 (0.7272727 0.2727273) *
##              89) age< 0.642669 33  16 1 (0.4848485 0.5151515)  
##               178) age< 0.6224703 21   8 0 (0.6190476 0.3809524) *
##               179) age>=0.6224703 12   3 1 (0.2500000 0.7500000) *
##            45) icu_los_day>=0.1885394 16   4 1 (0.2500000 0.7500000) *
##          23) age>=0.6716549 39  11 1 (0.2820513 0.7179487) *
##     3) age< 0.5790757 266  46 1 (0.1729323 0.8270677) *

Interpretation & Analysis: We splitted the dataset such that 80% is training data and 20% is test data. Then we used rpart to construct the classification tree. The above plot shows the important features used by the algorithm for classifying observations. The variables Age and icu_los_day emerge as the most important variables for carrying out recursive partitioning. It can be seen that for age greater than 58, most of the outcomes have “0” as the end result indicating a higher number for deaths of patients aged greated than or equal to 58 in the ICU.

3.9.4 K Nearest Neighbours

##                     classifier_knn
##                       0 0.0588235294117647 0.117647058823529 0.176470588235294
##   0                  17                  8                 3                 0
##   0.0588235294117647 17                  8                 5                 3
##   0.117647058823529   7                  9                 2                 2
##   0.176470588235294   0                  3                 6                 3
##   0.235294117647059   0                  5                 1                 6
##   0.294117647058824   0                  0                 0                 1
##   0.352941176470588   0                  0                 0                 1
##   0.411764705882353   0                  0                 0                 1
##   0.470588235294118   0                  0                 0                 0
##   0.529411764705882   0                  0                 0                 0
##   0.588235294117647   0                  0                 0                 0
##   0.647058823529412   0                  0                 0                 0
##   0.705882352941177   0                  0                 0                 0
##   0.882352941176471   0                  0                 0                 0
##                     classifier_knn
##                      0.235294117647059 0.294117647058824 0.352941176470588
##   0                                  1                 0                 0
##   0.0588235294117647                 1                 0                 0
##   0.117647058823529                  3                 2                 0
##   0.176470588235294                  1                 3                 0
##   0.235294117647059                  2                 3                 1
##   0.294117647058824                  2                 2                 2
##   0.352941176470588                  3                 0                 0
##   0.411764705882353                  2                 1                 1
##   0.470588235294118                  1                 2                 2
##   0.529411764705882                  0                 1                 1
##   0.588235294117647                  0                 0                 1
##   0.647058823529412                  1                 0                 0
##   0.705882352941177                  0                 0                 1
##   0.882352941176471                  0                 0                 0
##                     classifier_knn
##                      0.411764705882353 0.470588235294118 0.529411764705882
##   0                                  0                 0                 0
##   0.0588235294117647                 0                 0                 0
##   0.117647058823529                  0                 0                 0
##   0.176470588235294                  0                 0                 0
##   0.235294117647059                  0                 0                 0
##   0.294117647058824                  0                 0                 0
##   0.352941176470588                  0                 0                 0
##   0.411764705882353                  0                 0                 0
##   0.470588235294118                  0                 0                 0
##   0.529411764705882                  1                 0                 0
##   0.588235294117647                  0                 0                 0
##   0.647058823529412                  0                 0                 0
##   0.705882352941177                  1                 0                 0
##   0.882352941176471                  0                 1                 0
##                     classifier_knn
##                      0.588235294117647 0.647058823529412 0.705882352941177
##   0                                  0                 0                 0
##   0.0588235294117647                 0                 0                 0
##   0.117647058823529                  0                 0                 0
##   0.176470588235294                  0                 0                 0
##   0.235294117647059                  0                 0                 0
##   0.294117647058824                  0                 0                 0
##   0.352941176470588                  0                 0                 0
##   0.411764705882353                  0                 0                 0
##   0.470588235294118                  0                 0                 0
##   0.529411764705882                  0                 0                 0
##   0.588235294117647                  0                 0                 0
##   0.647058823529412                  0                 0                 0
##   0.705882352941177                  0                 0                 0
##   0.882352941176471                  0                 0                 0
##                     classifier_knn
##                      0.823529411764706  1
##   0                                  0  0
##   0.0588235294117647                 0  0
##   0.117647058823529                  0  0
##   0.176470588235294                  0  0
##   0.235294117647059                  0  0
##   0.294117647058824                  0  0
##   0.352941176470588                  0  0
##   0.411764705882353                  0  0
##   0.470588235294118                  0  0
##   0.529411764705882                  0  0
##   0.588235294117647                  0  0
##   0.647058823529412                  0  0
##   0.705882352941177                  0  0
##   0.882352941176471                  0  0
## [1] "Accuracy = 0.225165562913907"

Interpretation & Analysis: We tried various values of k to train the model, and the highest accuracy we could obtain was 31.7% for k=2 which is very less accuracy. Thus, we tried to train the dataset using the neural network algorithm.

3.9.5 Neural network

##           [,1]
## [1,] 0.5086503

3.9.6 SVM

##  Setting default kernel parameters
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 507 
## 
## Objective Function Value : -422.336 
## Training error : 1.213125
## agreement
## FALSE 
##     1

Interpretation & Analysis: Unfortunately, the model performance is actually worse than the previous one. SVM did not perform very well maybe because the dataset has more noise i.e. target classes are overlapping. Another possibility can be because of the dataset having linear features.

3.9.7 Linear Models (Logistic Regression)

##              age       gender_num      sapsi_first          chf_flg 
##       0.23401548       0.49976343       0.15878730       0.37410439 
##       censor_flg        renal_flg        wbc_first        hgb_first 
##       0.49167711       0.22400212       0.06767688       0.13801630 
##      icu_los_day hospital_los_day 
##       0.13859862       0.08316255
## 
## Call:
## glm(formula = censor_flg ~ age + chf_flg + sapsi_first + icu_los_day, 
##     family = "binomial", data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4585  -1.0067   0.4100   0.9296   1.8380  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.7184     0.3748   9.922   <2e-16 ***
## age          -4.6337     0.5304  -8.736   <2e-16 ***
## chf_flg      -0.1168     0.2437  -0.479   0.6317    
## sapsi_first  -1.6064     0.5870  -2.737   0.0062 ** 
## icu_los_day  -0.9380     0.6642  -1.412   0.1579    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 820.71  on 612  degrees of freedom
## Residual deviance: 684.09  on 608  degrees of freedom
## AIC: 694.09
## 
## Number of Fisher Scoring iterations: 4
#calculate probability of default for each individual in test dataset
predicted <- predict(mylogit, testing, type="response")

#calculate AUC
library(pROC)
auc(testing$censor_flg, predicted)
## Area under the curve: 0.7284

Interpretation & Analysis: Higher area under the curve (AUC) indicates better performance of the model and its ability to distinguish between the positive and negative classes. As the AUC is 0.72, the predictions of this model are moderately accurate and can be improved upon. Also, Akaike information criterion (AIC) is 875.18 and thus the smaller the AIC value, the better the model fit.

4. Results

We trained the dataset using four models - KNN, SVM, Neural Network and linear regression. The best fit model was linear regression as the area under the curve was 0.72 indicating a rate of 72% accurate predictions of the mortality considering the age, SAPS score at the time of ICU admission, and whether or not the patient has congestive heart failure (chf_flg=1 or chf_flg=0).

5. Discussion

Associations and correlations should have scientific validity. For future analysis of this question, we can investigate with the model by collecting more data and strategizing on addressing the selection of features that are representative of the sample so that it could be significant. After internal validation of the model, it is best practice to pilot it in other geographic areas for external validations and address any discrepancies before rolling out into the real world. Regulations should be disclosed that the model should not be misused by for-profit agencies when it comes to adjusting insurance premiums based on the health conditions, which could lead to disparities.

6. References

[1] https://www.kidneyfund.org/all-about-kidneys/risk-factors/heart-disease-and-chronickidney-disease-ckd

[2] https://sph.unc.edu/wp-content/uploads/sites/112/2015/07/nciph_ERIC11.pdf

[3] https://en.wikipedia.org/wiki/Bradford_Hill_criteria

[4] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589117

[5] https://www.tandfonline.com/doi/full/10.1080/10408444.2018.1518404?casa_token=RVeMSLdSSZ8AAAAA%3ADRCVh3shqK6SkEczgp-7q1SHyxLEEkAXpTgA7MUZWwCP3Ag9aajmfF9-DRns82AtZa_gAg-RCvM

[6] https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

[7] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3760015/

[8] https://link.springer.com/article/10.1007/s00134-005-2763-5

[9] https://www.nature.com/articles/s41598-021-03397-3.pdf?proof=t+target%3D

[10] https://www.frontiersin.org/articles/10.3389/fcvm.2021.774935/full

[11] Schoe A, Bakhshi-Raiez F, de Keizer N, van Dissel JT, de Jonge E. Mortality prediction by SOFA score in ICU-patients after cardiac surgery; comparison with traditional prognosticmodels. BMC Anesthesiol. (2020) 20:65. doi: 10.1186/s12871-020-00975-2

[12] P. E. Marik, “Management of the critically ill geriatric patient,” Critical Care Medicine, vol.  34, no. 9, pp. S176–S182, 2006

[13] Tang, Y. D., & Katz, S. D. (2006)]. Anemia in chronic heart failure: prevalence, etiology, clinical correlates, and treatment options. Circulation, 113(20), 2454-2461

[14] Final Stages of Heart Failure: End-Stage Heart Failure. (2020, January 14). Samaritan. https://samaritannj.org/hospice-blog-and-events/hospice-palliative-care-blog/end-stage-heart-failure-what-to-expect/

[15] Aftab Haq, Sachin Patil, Alexis Lanteri Parcells, Ronald S. Chamberlain, “The Simplified Acute Physiology Score III Is Superior to the Simplified Acute Physiology Score II and Acute Physiology and Chronic Health Evaluation II in Predicting Surgical and ICU Mortality in the “Oldest Old””, Current Gerontology and Geriatrics Research, vol. 2014, Article ID 934852, 9 pages, 2014. https://doi.org/10.1155/2014/934852.